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ABSTRACT 

We present a novel algorithm, Westfall-Young light, for de¬ 
tecting patterns, such as itemsets and subgraphs, which 
are statistically significantly enriched in one of two classes. 
Our method corrects rigorously for multiple hypothesis test¬ 
ing and correlations between patterns through the Westfall- 
Young permutation procedure, which empirically estimates 
the null distribution of pattern frequencies in each class via 
permutations. 

In our experiments, Westfall- Young light dramatically out¬ 
performs the current state-of-the-art approach in terms of 
both runtime and memory efficiency on popular real-world 
benchmark datasets for pattern mining. The key to this ef¬ 
ficiency is that unlike all existing methods, our algorithm 
neither needs to solve the underlying frequent itemset min¬ 
ing problem anew for each permutation nor needs to store 
the occurrence list of all frequent patterns. Westfall-Young 
light opens the door to significant pattern mining on large 
datasets that previously led to prohibitive runtime or mem¬ 
ory costs. 

1. INTRODUCTION 

Frequent pattern mining is one of the fundamental prob¬ 
lems in data mining In its most general form, one is 
given a database of transactions, each of which includes a 
set of items. A freguent pattern is then a set of items that co¬ 
occur in the same transactions more often than a predefined 
frequency threshold. Frequent pattern mining is at the heart 
of important problems such as association rule mining [^. 

While the classic problem assumes that there is only one 
class of transactions, an important extension is to consider 
the case that two classes of transactions are given and one is 
interested in finding those sets of items that occur statisti¬ 
cally signihcantly more often in one class than in the other. 


This problem of significant pattern mining is of fundamental 
importance to many applications of pattern mining, such as 
subgraph, substring and itemset mining: 

Subgraph mining: Given a collection of graphs, we aim 
to discover subgraphs which help separate the two classes 
of graphs. For instance, graphs represent drugs, which are 
either active or inactive regarding their ability to bind a spe¬ 
cific target. Then, one would seek subgraphs, corresponding 


to molecular motifs, associated with drug activity 23 


Substring mining: Similarly, objects could be labeled 
strings, leading to the problem of Hnding substrings which 
are characteristic of only one of the two string classes. For 
example, one might be looking for k-mers of DNA nucleotides 
which are enriched in a class of DNA strings known to con¬ 
tain binding sites for a protein of interest [20| . 

Itemset mining: Alternatively, in a labeled transaction 
database, one might be interested in identifying subsets of 
items whose occurrence is linked to the transaction class. 
One application of this setting is searching for higher-order 
interactions of binary predictors such as combinatorial reg¬ 
ulation of gene expression by binding of transcription fac¬ 
tors (TFs) [^. There, items correspond to binary features 
(TFs) and each transaction in the database is the subset of 
binary features taking value one for a specihc realization of 
the binary predictors, labeled with the corresponding gene 
expression level (low or high). 

Many methods for finding patterns that are associated 
with class membership have been proposed in the litera¬ 
ture @[13 [Em (see for a comprehensive survey), 
including several that provide a measure of statistical signif¬ 
icance between the occurrence of a pattern and class mem¬ 
bership illllE 


31 . However, none of the aforemen¬ 


tioned methods considers the inherent multiple hypothesis 
testing problem in significant pattern mining. It arises due 
to the combinatorial explosion in the number of patterns 
that one tests for significant association with class member¬ 
ship, which can lead to millions of patterns being deemed 
significant by mistake. Our goal in this article is to propose 
a method for significant pattern mining which can correct 
for multiple testing. 

The multiple hypothesis testing problem is that given a 
large enough number of tests, signihcant but false discover¬ 
ies can be made with high probability [^. In pattern min¬ 
ing, where often billions of tests are being performed, it is 
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very likely that millions of associations between patterns and 
class labels will appear to be statistically significant simply 
due to chance. 

Such false positive findings (patterns) are a problem for 
disciplines such as medical research, biology, neuroscience, 
psychology, or social sciences that use techniques of pattern 
mining for choosing patterns for further experimental inves¬ 
tigation. Here, the cost of false positives is enormous, re¬ 
sulting in many hours of research and resources wasted. As 
a result, there is a pressing need to develop pattern mining 
algorithms which can rigorously correct for multiple testing 
and avoid false positives. 

In this paper, we introduce the first algorithm that is 
able to solve the multiple testing problem in significant pat¬ 
tern mining settings optimally and with attractive time and 
space requirements. By “optimal’ we refer to achieving the 
maximum possible number of true discoveries while strictly 
upper-bounding the probability of false discoveries by a pre¬ 
defined threshold. That is, we correct for multiple-testing 
by tightly controlling the Family Wise Error Rate (FWER, 
whose formal definition is given in the next section) without 
imposing any limit on the maximum pattern size. 

The paper is organized as follows. Section describes 
the multiple hypothesis testing problem in detail, discusses 
the current state-of-the-art in significant pattern mining and 
introduces the necessary concepts for our proposal. Sec¬ 
tion 0 introduces our contribution both from a theoreti¬ 
cal and algorithmic perspective. Section presents an ex¬ 
perimental study which empirically shows that our method 
outperforms existing techniques in two different setups of 
itemset mining and subgraph mining. For completeness, 
Section briefly describes other algorithms for significant 
pattern mining with sub-optimal FWER control. Finally, 
Section [^summarizes our key findings. 

2. BACKGROUND 

2.1 The Multiple Hypothesis Testing Problem 

In statistical association testing, the dependence of each 
pattern on the class labels is quantified using its correspond¬ 
ing p-value. In this context, the p-value is defined as the 
probability of measuring an association at least as extreme 
as the one observed in the data under the assumption that 
there is no real association present in the data-generating 
process. More precisely, let Hq'^ be the null hypothesis that 
the f-th pattern being examined is statistically independent 
of the class labels, the test-statistic chosen to evaluate 
association and the observed value of Assuming 

that larger values of indicate stronger dependence, the 
p-value is defined as pi = Pr(r^®^ > ^ ). In practice, 

a pattern is considered to be significant if pi < a, where a 
is called a significance threshold. 

When only one pattern is being tested, the significance 
threshold a coincides with the type I error probability, the 
probability of incorrectly rejecting a true null hypothesis. 
On the other hand, if we have a large number of patterns 
under study, statistical association testing becomes consid¬ 
erably more challenging because of the multiple hypothesis 
testing problem. If there is a large number D of patterns and 
those which satisfy pi < a are ruled as significant, then the 
probability that at least one out of the D patterns will be 
wrongly deemed to be associated with the labels converges 


to 1 at an exponential rate. That probability, commonly 
referred to as the Family-Wise Error Rate {FWER), can 
be written mathematically as FWER = Pr (FP > 0), where 
FP denotes the number of false positives incurred by the 
testing procedure, i.e., the number of patterns erroneously 
considered to be dependent on the labels. 

When designing a multiple hypothesis testing procedure, 
upper bounding the FWER by a predefined value a is the 
most popular approach. Note that, from a data mining per¬ 
spective, a must never be considered as a parameter to be 
optimized but, rather, as an application-dependent require¬ 
ment set before observing the data. Intuitively, a controls 
the amount of “risk” the user is willing to take when search¬ 
ing for patterns in the data. Larger values of a correspond¬ 
ing to more risk but, also, enhanced ability to discover real 
signals. 

Then, the ideal problem to be solved is 

5* = max { 5 \ FWER((5) < a } , 

where 5 is the corrected significance threshold to determine 
which patterns are statistically significant and FWER((5) is 
the resulting FWER of that procedure. Nonetheless, since 
evaluating FWER(5) is a costly task, most methods use 
some quantity as a surrogate upper bound of the FWER 
instead. For instance, one of the most popular multiple 
testing corrections, the Bonferroni correction [^, defines 
f{5) = SD > FWER((f). However, in data mining set¬ 
tings, it is intuitively clear that a large amount of patterns 
will in fact be a part of its super patterns. Hence com¬ 
plex correlation structures involving the different D test 
statistics will exist. Since the surrogate used by the Bonfer¬ 
roni correction completely neglects the dependence between 
test statistics, the resulting upper bound is quite loose, i.e., 
FWER((5) <C SD. That leads to over-conservative proce¬ 
dures with a significant loss of statistical power. It is there¬ 
fore of great interest to develop algorithms to solve the orig¬ 
inal problem without using any upper bound on the FWER 
as a surrogate. 

2.2 Westfall-Young Permutation-based 
Hypothesis Testing 

By definition, the FWER can be expressed as FWER(5) = 
CDFn(5), where Q is the random variable Q. = Pi I 

}, = HigA: ^ 0 *^ ^ hypothe¬ 

ses. Under a certain technical condition known as the subset 
pivotality condition, the FWER can also be obtained as the 
CDF of the random variable fl' = rninig{ }{ pi | Ho}, 
with Ho = n{ 1 D} which is simpler as it does not 

depend on the (unknown) ground truth. Since most often 
there is no anal 3 rtic formula available for the distribution 
of fl' nor its CDF, one resorts to resampling methods to 
approximate it. 

In the context of statistical association testing, a permu¬ 
tation based resampling scheme proposed by Westfall and 
Young [^ is one of the most popular approaches to ac¬ 
count for correlation structures. The idea is as follows: if 
one has to test the association of D random variables 
with another random variable Y, a simple way of generating 
samples of the p-values under the global null (i.e. to gener¬ 
ate samples of 11') is by randomly shuffling (permuting) the 
observations of Y with respect to those of the random vari¬ 
ables The permuted datasets obtained that way 
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are effective samples of the distribution Pr({Si}i^i )Pr(^) 
and, therefore, correspond to samples obtained from the 
global null hypothesis that all variables are inde¬ 

pendent of Y. Those can then be used to obtain samples 
from PDF(n') by computing the respective p-values and 
taking the minimum across all patterns in the database, 
i.e. Pmin = niinigj 1 . } Pi ~ PDF(n'), where pi is the p- 

value resulting from evaluating the association between Si 
and one realization of the permuted class labels Y. 

While conceptually simple, the Westfall-Young permu¬ 
tation scheme can be extremely computationally demand¬ 
ing. Generating a single sample from PDF(n') naively re¬ 
quires computing the p-values corresponding to all D pat¬ 
terns using a specific permuted version of the class labels, 
Y. Besides, to get a reasonably good empirical approxima¬ 
tion to FWER((5) = CDFn/((5), one needs several samples 
Pmin ~ PDF(f2'). As it will be shown in the experiments 
section, a number of samples in the order of J = 10^ or 
J — 10* is needed to get reliable results. 

2.3 State-of-the-art in Significant Pattern 
Mining 

Terada et al. were the first to propose an efficient 
algorithm, called FastWY, which enables us to use Westfall- 
Young permutations in the setting of itemset mining. The 
method uses inherent properties of discrete test statistics 
and succeeded to reduce the computational burden which 
the Westfall-Young permutation-based procedure entails. In 
the following we introduce the algorithm since we share the 
same problem setting and the key concept of the minimum 
attainable p-value. 

In pattern mining, every p-value can be obtained from a 
2x2 contingency table: 


Variables 

Si = 1 

Si =0 

Row totals 

Y = 1 

ai 

n — ai 

n 

0 

II 

Xi — ai 

N -\- Oi — n — Xi 

N-n 

Col totals 

X* 

N — Xi 

N 


Such tables are generally used to evaluate the strength of 
the association between two binary variables. In our case, 
those correspond to the class labels Y and a variable Si that 
indicates whether the i-th pattern is present or not for each 
of the N objects in the database {Si = 1 meaning that the 
f-th pattern exists in the object). In a given database, N 
denotes the total number of objects, n the number of objects 
with class label Y = 1, Xi the number of objects containing 
the i-th pattern, and ai the number of objects with class 
label Y = 1 containing the i-th pattern. Without loss of 
generality, we assume that the class labels are encoded so 
that n < N — n. 

Let us use Fisher’s exact test to derive the test statis¬ 
tic, as it is one of the most popular methods to obtain a 
p-value out of a 2 x 2 contingency table. Fisher’s exact test 
assumes that all marginals Xi, n, and N are fixed. Thus, the 
table has only one degree of freedom left and one can model 
it as a one-dimensional count ai. Under those assumptions, 
it can be shown that the underlying data-generating model 
under the null hypothesis of statistical independence be¬ 
tween Si and Y is a hypergeometric random variable. 

Mathematically, the probability is Pr(ai = k\xi,n,N) = 

(fc) (re —fc)/( 3 ; ) ^ilE Uj^niin ^ ^ Uj^niax, where Ui,min — 

max{0, Xi -|- n — A^} and fli^max = min{n, Xi}. If we observe 
m = 7 , the corresponding one-tailed p-value is obtained 


as Pi( 7 ) = min{<l?i( 7 ), <E>r( 7 )}, where we define 'l’i( 7 ) = 
= k\xi,n,N) and $^( 7 ) = Efc=r" = 
k\xi,n,N) as the left and right tails of the hypergeometric 
distribution, respectively. To get a two-tailed p-value, one 
can simply double the one-tailed p-value. 

Since Oi can only take Oi^max — Ui.min + 1 different val¬ 
ues, the corresponding p-values pi{ai) are also finitely many. 
Hence, there exists a minimum attainable p-value defined as 
T(xi,n, iV) = min{pi( 7 ) | ai,min < 7 < aqmax }• As n and 
N depend only on the labels and are therefore fixed for all 
patterns, we abbreviate T(xi,n, Y) as ^'(xi) in the rest of 
the paper. Related to the minimum attainable p-value, we 
also introduce the set of testable patterns at corrected signif¬ 
icance level 6, Xt{S) = { i G {1,..., D} \ T(xi) < h }. Thus 
patterns not in Tt{S) can never be significant at level 5. 

The key to the FastWY algorithm is to exploit this mini¬ 
mum attainable p-value, which the algorithm requires to be 
monotonically decreasing in [0, A]. To achieve this require¬ 
ment, FastWY uses a surrogate lower bound ^(xi) such that 


’l'(Xi) 


^(xi) 0 < Xi < n, 

I/O n<x,<N 


and defines iriS) = {i G {1,..., D} \ 4'(xi) < 5 }, which al¬ 
ways satisfies Tt{5) C Tt{5). In other words, the set of pat¬ 
terns they retrieve might contain some unnecessary patterns. 
Nonetheless, as a result of the monotonicity of we can 
rewrite iTiS) as iT{o'{5)) = {i G D} \ Xi > cr(5) } 

using a{S) = inf{ Xi | \E'(xi) <5}. This new formulation 
is important because XT{<y{S)) can be readily seen to corre¬ 
spond to an instance of frequent pattern mining with support 
(t(( 5), a well studied problem in data mining. 

That property is the basis of FastWY as follows: If the 
p-values for all patterns in Xt{o') are known and p'^^m = 
min { Pi I i G Xt{o) } < 5'(cr), no pattern in { 1,..., D } \ 
Xt (cr) can possibly attain a p-value smaller than p^in. Hence 
Pmin = Pmin S'lid the Search can be stopped early. That can 
be exploited to sample from PDF(H') without having to 
explicitly compute all D p-values. 

FastWY is based on a decremental search scheme starting 
with the support a = n. For each a, first a frequent pat¬ 
tern miner is used as a black box to retrieve the set Xt{(7). 
Then, the p-values Pi are computed for all i G Xt{o') and 
Pmin = min { Pi I i gXt {a) } is evaluated. If p^in < ^(u), 
no other pattern makes pJnin smaller and, therefore, p^in = 
Pmin constitutes an exact sample from PDF(H'). Otherwise 
if Pmin > is decreased by one unit and the whole 

procedure is repeated until the condition p^in < is 

satisfied. 

If J permutations are needed to empirically estimate the 
FWER, the procedure has be to repeated J times. That 
includes the whole sequence of frequent mining problems 
needed to retrieve the sets XT{(r) for each support value a 
used throughout the decremental search. Given the usual 
range of values for J, such an approach is just as unfeasible 
in practice as the original brute force baseline. 

Interestingly, careful inspection of the code kindly shared 
by the authors in their website reveals that the actual imple¬ 
mentation of the algorithm is different from the description 
in [^. Indeed, to alleviate the inadmissible burden of re¬ 
peating the whole frequent pattern mining process J times, 
the authors resort to storing the realizations of the variables 
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Si for all i G Tt^c) and every value of cr explored during the 
decremental search. Clearly, this decision corresponds to a 
drastic trade-off between runtime and memory usage since, 
as we will show empirically, there exist many databases for 
which the amount of information to be stored when follow¬ 
ing that approach is too large to fit even in large servers 
equipped with 512 GB of RAM. Furthermore, we will also 
demonstrate that the algorithm resulting from trading off 
memory in exchange for runtime is still slow even for mid¬ 
sized datasets and can be greatly improved. 

When referring to the FastWY algorithm in through¬ 
out the remaining of this paper, we mean the runtime op¬ 
timized version of the algorithm, given that the algorithm 
exactly as described in 26 simply cannot be run in a rea¬ 
sonable amount of time. 


3. THE WESTFALL-YOUNG LIGHT 
APPROACH 

As discussed previously, despite being an interesting idea, 
several properties of the FastWY algorithm do not allow 
its application on large datasets. The main reasons why 
the state-of-the-art algorithm is inefficient are: (1) It uses 
a decremental search strategy, which is known to be orders 
of magnitude slower than incremental search (see Sec¬ 
tion]^ for more detail); (2) naive application of the method 
requires either repeating pattern mining J times, with J in 
the order of 10^, or alternatively storing in memory the list 
of occurrences among the N objects in the database of all 
frequent patterns; (3) J exact samples from PDF(r2') have 
to be computed, even though the calculation of 5* actually 
involves only the smallest \aJ~\ samples; (4) as a direct con¬ 
sequence of (3), the larger J the more likely it is that some 
of the samples of fl' fall in the upper tail of the distribu¬ 
tion, hence requiring to mine patterns with unnecessarily 
low supports and significantly increasing both the overall 
runtime and the memory requirements; (5) it relies on using 
a surrogate lower bound ^(xi) on the minimum attainable 
p-value 'i’{xi), despite the fact that such a strategy might 
consider some unnecessary patterns. 

In this section, we show how to remove all the limitations 
listed above, resulting in a novel procedure for permutation- 
based significant pattern mining. This new algorithm re¬ 
quires significantly less memory and can be shown to be up 
to 3 orders of magnitude faster in real world data. 

We show first how to get rid of limitation (5). Then, after 
introducing two previously unexploited key properties of the 
FWER estimator used in permutation testing, we describe 
how to get rid of limitations (1) up to (4). 



Figure 1: Exact minimum attainable p-value ’F(a:i) for n = 
10 and = 50 (blue dots). Two different types of regions 
Sj and Efc are illustrated (red and black lines). 


ferent values in [0,A]. This in turn implies that, despite 
depending on a real-valued parameter <5 € [0,1], there are 
only [yj + 1 different sets of testable patterns Xt{S). If 
we sort the range of ^{xi) as a monotonically increasing se¬ 
quence < ... < (5i < do = 1, it becomes 

clear that Tt{S) = TriSk) for all 5 G [5fe,(5fc_i). More 
importantly, for each threshold 5k with k = 0, ...,L'yJ, 
there exists a corresponding region Ej, G [0, N] such that 
Xi G Ei, if and only if i G It{ 5k)- Those regions can be 
of two different types: (1) if 5k < ’I'(L'yJ)i the region is 
the union of two symmetric intervals, i.e., Ej, = [(Tf,cr(i] U 
[A — (Ju, A — erf]; (2) if 5k > ^(LtJ) the region is composed 
of a single interval, Es, = [erf, A — erf). We have intro¬ 
duced erf and erf as erf = min { a: G [0, n] j ^'(a;) < dfc } and 
erf = max{x G [n, [yj] j <['(x) < 5k }, respectively. Note 
also that, regardless of the type, the regions Ej, are always 
symmetric around A/2. Both types of regions are illustrated 
with an example in Figure 

Finally, the last key observation is that Ej C E* when¬ 
ever j > k. In other words, the corresponding regions 
{ X I 4'(a:) < d } shrink monotonically with respect to <5. Tha t, 
and not the strict monotonicity of 4'(a;i) as argued in [25| , 
is the actual property needed to implement the method in 
[24| via branch-and-bound techniques. Indeed, recovering 
It{ 5k) amounts to an instance of frequent pattern mining 
with support erf in which frequent patterns whose support is 
not in Efe are discarded. Besides, given E^ and 5k, comput¬ 
ing Efc+i and (5fc+i has complexity 0(1) as we will discuss 
later. 


3.1 Removing Limitation (5) 

Discrete test statistics not only result in a finite set of 
p-values but, similarly, they have finitely many minimum 
attainable p-values. The exact minimum attainable p-value 
function ^'(xi) is given by: 


'^{Xi) 


/ j2^ {N — Xi)\ 

N\ {n — Xi)\ 

{N-n)\ xjl 
N\ {xi—n)\ 

\ {N-n)\ {N-Xi)\ 

W\ {{N-n)-Xi)\ 

n!_ Xjl _ 

“ N\ {xi — {N — n))\ 


0 < Xi < n, 
n < Xi < 

Y < Xi < N — n, 
N — n < Xi < N, 


( 1 ) 


(see Figure]^. Note that 4/(a;i) = T(A — Xi) always folds 
for 0 < Xi < A/2. The function 4'(xi) takes -|- 1 dif- 


3.2 Removing Limitations (l)-(4) 

Now we show how, by exploiting properties of the FWER 
estimator, it is possible to rearrange computations in a way 
that allows: (1) using an incremental search scheme, so that 
the frequent pattern mining algorithm needs to be run only 
once instead of J times without any extra memory require¬ 
ments other than just storing the binary A-by-J matrix of 
permuted class labels (limitations 1 and 2); (2) only the 
minimum \aj] smallest samples from PDF(f2') need to be 
generated exactly, significantly reducing the overall frequent 
pattern mining effort (limitations 3 and 4). 

Let / = 1,..., J be the set of minimum p-values for 
each permutation, i.e. J different exact samples from the 
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distribution PDF(r2'). Then, the empirical estimator of the 
FWER at corrected significance level 5 is given as 

FWER(5) = i^l[pO)^<5], 

where ![•] denotes a function which evaluates to 1 if its 
argument is true and to 0 otherwise. The following two 
properties of that estimator are the theoretical basis of our 
contribution: 

Property 1. Whenever a new pattern i is processed, the 
updated empirical FWER estimate can never deerease. 

Proof. We have min {pi \ i £ T \J {k}} — min{min{pi | 
i £ T},pfc} and hence l[min{pi \ i £X} < 5k] < l[min{pi | 
i £ T yj {k}} < 5k]- Thus this property readily follows. □ 

Property 2. FWER((5) for all 5 £ [0,5*,] can be evalu¬ 
ated exactly using only the p-values of patterns in TT{'Ek). 

Proof. Let = min{pi | i £ XT^Ek) }. By definition 
of lT{T.k), if Pmin > Sk then pmin > 5fe, leading to l[pmin < 
4] = l[Pmin < <5fc]. □ 

By putting both properties together, in Section [3.4| we will 
show how one can implement an incremental connterpart of 
FastWY which removes all limitations (1) up to (4). 

3.3 Evaluating Fisher’s Exact Test p-values in 
Negligible Time 

Here we provide an efficient technique for computing p- 
values of Fisher’s exact test, which is also crucial for im¬ 
proving the efficiency if J is large. 

Roughly speaking, one can divide the computational effort 
required to apply the Westfall-Young permutation testing 
procedure to data mining problems into 4 main categories: 

( 1 ) compute the matrix of permuted class labels; ( 2 ) for each 
pattern, compute the cell counts for all J permutations; 
(3) for each pattern, compute the p-values corresponding to 
the J cell counts, p^/\a\^'^) and; (4) the frequent pattern 
mining itself. ( 1 ) can be considered to be negligible in most 
large datasets and therefore is not a major concern. (4) will 
be reduced to the bare minimum by using an incremental 
search strategy. While that effort might still be considerable, 
it appears difficult a priori to reduce it further. Finally, both 

(2) and (3) have a complexity 0{xiJ) per pattern. Even if 
that might be small, bearing in mind that the number of 
patterns to be inspected can be in the order of trillions and 
J ~ 10^, it could be an even more demanding task than 
frequent pattern mining itself, depending on the dataset. 
Nonetheless, by using a careful arrangement of the p-value 
computations only made possible by incremental search, we 
will show how to reduce the complexity of (3) from 0{xiJ) 
per pattern to 0(xi) per pattern, leaving (2) and (4) as the 
sole contributors to the overall runtime in our algorithm. 
For that we exploit the following computational property of 
Fisher’s exact test: 

Property 3. For fixed Xi, n, and N, the computational 
complexity of evaluating Fisher’s exact test p-value Pi{'y) 
for a single value of 7 or for all possible values of 7 in 
[ai.min, ai,max] is the Same and equal to 0{mm{xi,n}). 

Proof. Let pi{'y) = min ($ 1 ( 7 ), (hr( 7 )), with 4 ?i( 7 ) and 
"hr ( 7 ) as defined in section Computing Pi( 7 ) for a sin¬ 
gle value of 7 requires evaluating the probability mass of 


a hypergeometric random variable with parameters Xi, n, 
N in all its support and computing the left and right tail 
mass, with split point at 7 . Hence, the complexity of evalu¬ 
ating Pi( 7 ) for a single fixed 7 is 0{mm{xi,n}). Note that 
all of those computations can be shared across all J per¬ 
mutations as long as Xi, n, N are fixed. We can therefore 
compute the p-values corresponding to all possible values of 
7 instead of just a single one also with the same complexity 
0 (min{a:i, n}). □ 

To exploit this property, we use an enumeration scheme 
which processes each pattern at a time but all J permuta¬ 
tions simultaneously, something which is not feasible with 
the decremental scheme of FastWY. As a result, by using 
Property!^ we share the computational burden of comput¬ 
ing p-values for a given pattern across all J permutations, 
in a way that the resulting complexity is the same as for 
the case J = 1. In contrast, FastWY resorts to the strategy 
of storing in memory all previously evaluated p-values to 
avoid recomputing them many times. Again, this is an im¬ 
plementation aspect not discussed in their paper but present 
in their code, which represents another memory versus run¬ 
time trade-off. 

3.4 The Algorithm: Westfall-Young Light 

Exploiting all those properties, we propose the following 
algorithm that removes all limitations (1) up to (4) equipped 
with efficient computation of p-values. 

First, precompute all J sets of permuted class labels. Then, 
initialize 5*,, S*,, and cr*, for k = 1, that is, the threshold to 
the maximum (non-trivial) value 5i = n/N, the associated 
testable region Ei = [ 1 , A — 1 ], and the minimum support 
for frequent pattern mining cr] = 1. Note that we skipped 
A: = 0, since it corresponds to a trivial threshold 5o = 1 for 
which every pattern is testable and significant. Next, start 
enumerating patterns using the frequent pattern miner of 
choice and, every time a frequent pattern is found, do the 
following: (1) check if its support Xi satisfies Xi £ T,k (i.e. 
check if the pattern is testable), if not simply continue min¬ 
ing; ( 2 ) if yes, precompute all possible p-values for a hyper¬ 
geometric random variable with parameters Xi, n and A; 

(3) compute the cell counts for all j = 1,..., J and fetch 
the corresponding p-values fP^ = pi{a^^^), updating p^j^^ if 
Pi < pHini ('^) check if the current FWER is too high, if it 
is, increase k (thus reducing 5k, shrinking E*,, and increasing 
the minimum support uf for frequent pattern mining) until 
the FWER is below the target a again; and (5) continue 
running the frequent pattern miner without restarting. 

This combined scheme of frequent pattern enumeration 
and adaptive threshold adjustment continues until all pat¬ 
terns for a certain (fc, 5fc,Efc) have been enumerated. At 
that point, one knows for sure that the optimal threshold 
5* must lie in [5fe,5*,_i) and, despite not having all J sam¬ 
ples Pmin ~ PDF(fl'), we do have all of the samples which 
satisfy p^in < 5k-i- Since 5* < 5k-i, those are all that is 
needed to exactly obtain the optimal corrected significance 
threshold 5*. On average, the number of samples involved 
in that computation will be in the order of [aJ] <C J. 

The pseudocode is shown in Algorithm For simplicity, 
we assume a generic frequent pattern miner which enumer¬ 
ates all patterns in the form of a rooted tree, with children 
having a support no larger than that of the parent. Due 
to the removal of limitation (5), adapting the algorithm to 
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deal with the opposite situation, i.e. children having a sup¬ 
port not smaller than that of the parent would be trivial. 
Also, note that the effort required to update Sk and E*, is 
absolutely negligible, especially if all values of ^(a;) for all 
X G [0, N] are precomputed at startup. 

To summarize, our main contribution is the development 
of what we believe to be the first practical algorithm to mine 
statistically significant patterns while controlling the FWER 
in an optimal sense, i.e., by estimating the empirical null 
distribution to exactly compensate the dependence structure 
existing between test statistics. Compared to the current 
state-of-the-art FastWY algorithm, our proposal improves 
the following aspects; 

1. It retrieves the exact set of testable itemsets instead 
of a surrogate superset. 

2. It uses an incremental search strategy, significantly re¬ 
ducing the frequent pattern mining effort. 

3. It does not need to store the occurrence list of each 
frequent pattern in memory. 

4. It does not need to compute the [(1 — a)J\ samples 
from Pmin ~ PDF(fl') which lie in the upper tail of the 
distribution. This improvement significantly reduces 
the frequent pattern mining effort further. Besides, 
the runtime related to the computation of cell counts 
is also decreased. 

5. It uses an efficient scheme to share the computation of 
p-values across permutations, reducing the correspond¬ 
ing runtime for that task by a factor of J, with J in the 
order of 10"^. That also avoids caching computations 
at a further expense of memory usage. 

Out of those 5 improvements, 2), 3) and 4) have the great¬ 
est overall impact and are all direct consequences of using 
Properties and to derive an incremental search scheme. 
Improvement 5) is independent of those and roughly halves 
the runtime in databases for which the computation of cell 
counts and p-values is the main bottleneck. On the contrary, 
it brings more modest speedups for datasets that have fre¬ 
quent pattern mining as the limiting runtime factor. Finally, 
improvement 1) is not directly related to computational effi¬ 
ciency, but refines the theoretical background and enhances 
the applicability of the method. 


4. EXPERIMENTS 

In this section, we demonstrate our hve improvements 
(see Section 3.41 across a wide range of scenarios with an 
exhaustive experimental study in two representative data 
mining problems: significant itemset mining and significant 
subgraph mining. 


4.1 Experimental Setup 

As discussed in Section a lit eral implementation of the 
FastWY algorithm described in would lead to imprac¬ 
tical runtime requirements for virtually all datasets, leaving 
us without a comparison partner. Thus we chose to employ 
the method they actually implementecQ as a solo compari¬ 
son partner. 

^Code available in http://a-terada.github.io/lamp/ 


Nonetheless, their code is not efficient enough to provide 
a fair comparison, mainly due to it being written in Python. 
As our method has been carefully written in C/C-l—I- with 
as many fine optimizations as we could include, we reimple¬ 
mented FastWY from scratch in C/C-l—1-, trying to optimize 
the code as much as that of our own algorithm. As a result, 
our implementation of FastWY, while being the same algo¬ 
rithm, is about 2 or 3 orders of magnitude faster and reduces 
the amount of RAM used by one or two orders of magni¬ 
tude. Thus differences between our method and FastWY 
which are not due to pure algorithmic considerations but to 
implementation issues were eliminated. 

Itemset Mining Specific Setnp: As a frequent itemset 
miner, we chose LCM version 3 27 for both of Westfall- 


Young light and FastWY. LCM has been shown to exhibit 
state-of-the-art performance in a great number of datasets 
and won the FIMI’Od frequent itemset mining competition 
[12| . Even though the original implementation of FastWY 
relies on LCM version 5, we chose LCM version 3 because 
it is the fastest among all so far according to the author 
of LCM. For that, we modified the source code, adding the 
missing features needed to keep track of the occurrence list 
of each frequent itemset. 

All the code was compiled using Intel C-f-f compiler ver¬ 
sion 14.0.1 with flags -03 -xavx. Each instance was exe¬ 
cuted as a single thread on a cluster computer, whose nodes 
are equipped with up to 256 GB of RAM and two 2.7 GHz 
Intel Xeon E5-2697v2 CPUs. 

Subgraph Mining Specific Setup: In subgraph min¬ 
ing, we employed Gaston as a frequent subgraph mi neiH 
because it is reported to be one of the fastest algorithms [30| . 
Both Westfall-Young light and FastWY were integrated into 
Gaston. All of the methods were written in C-|--|- and com¬ 
piled with gcc 4.9.0 with an option -03. We used Mac OS X 
version 10.9.5 equipped with 32 GB of RAM and a 3.5 GHz 
Intel Core 17-4771 CPU. 

Itemset Mining Datasets: We used 4 labeled data¬ 
sets: TicTacTo^ Inetad^ Mushroom, and Breast cancer. 
The first three are commonly studied datasets taken from 
the UCI repository: Tic-tac-toe was binarized representing 
the three possible states (empty, “x” or “o”) of each space 
in the 3x3 grid with binary indicators; non-binary features 
and features with missing values were discarded from In- 
etads; Mushroom was processed as described in Finally, 
Breast cancer is described in 26 and used as an example 


of a challenging dataset, despite the fact that, as we will 
show, it is far from being the most demanding among our 
own battery of tests. 

In addition, we used 8 unlabeled datasets from the well- 
known public benchmark datasets for frequent itemset min¬ 
ing [^: Bmspos, BmsWebview, Retail, T10I4D100K, T40- 
IIODIOOK, Chess, Connect and Pumsb-star. Since the la¬ 
bels themselves only affect the algorithm via n and N as 
far as finding the corrected significance threshold <5* is con¬ 
cerned, we considered two representative cases: A/n = 2 or 
10. Note that, since 4t(a;) « {N/n)~^ for x n, the more 
unbalanced the classes are, the larger the resulting testable 


^Code available in http://www.liacs.nl/~snijssen/ 
gaston/ices.html 

■^https : //archive . ics .uci . edu/ml/datasets/ 
Tic-Tac-Toe+Endgame 

‘‘https : //archive . ics .uci . edu/ml/datasets/Internet+ 
Advertisements 


6 








Table 1: Characteristics of itemset mining datasets. N and n are the number of transactions in total and in the minor class, 
respectively, \E\ refers to the number of items, and ||T|| /N is the average transaction size. 


Property 

TicTacToe 

Chess 

Inetads 

Mushroom 

Breast cancer 

Pumsb-star 

Connect 

BmsWebview 

Retail 

T10I4D100K 

T40I10D100K 

Bmspos 

N 

958 

3196 

3279 

8124 

12773 

49046 

67557 

77512 

88162 

100000 

100000 

515597 

N/n 

2.89 

- 

7.14 

2.08 

11.31 

- 

- 

- 

- 

- 

- 

- 

\E\ 

18 

75 

1554 

117 

1129 

7117 

129 

3340 

16470 

870 

942 

1657 

\\T\\/N 

6.93 

37.00 

12.00 

22.00 

6.70 

50.48 

43.00 

4.62 

10.31 

10.10 

39.61 

6.53 


Table 2: Characteristics of subgraph mining datasets, where \V\ and \E\ denote the number of vertices and edges, respectively. 


Property 

PTC (MR) 

PTC (FR) 

PTC (MM) 

PTC (FM) 

MUTAG 

ENZYMES 

Q 

4a 

Q 

NCIl 

NCI41 

NCI109 

NCI167 

NCI220 

N 

584 

584 

576 

563 

188 

600 

1178 

4208 

27965 

4256 

80581 

900 

N/n 

3.23 

3.74 

3.18 

3.15 

2.98 

2.00 

2.42 

2.00 

17.23 

2.00 

8.38 

3.10 

avg.|U| 

31.96 

31.96 

31.47 

31.78 

17.93 

32.63 

284.32 

60.12 

47.97 

59.48 

39.70 

46.87 

avg.|A| 

32.71 

32.71 

32.18 

32.50 

39.59 

62.14 

715.66 

62.72 

50.15 

62.09 

41.05 

48.52 


region will be, resulting in more testable patterns and in¬ 
creased computational demands. This results in a total of 
20 different cases to be tested. We summarize the main 
properties of each dataset in Table 

Subgraph Mining Datasets: We used 12 labeled graph 
datasets: four PTC (Predictive Toxicology Challenge) data- 
set0 MUTAG, ENZYMES, D&eQ and four NCI (National 
Cancer Institute) dataset^ where ENZYMES and D&D are 
proteins and others are chemical compounds. These data¬ 
sets are popular benchmarks and have been frequently used 
in previous studies (e.g. 16 21). Graph nodes are labeled in 


all datasets and edges are also labeled except for ENZYMES 
and D&D. In the four PTC datasets, graphs labeled as CE, 
SE, or P were treated as positive and those of NE or N as 
negative, the same setting as in [14| . Properties of these 
datasets are summarized in Table [21 

Note that the number of nodes in subgraphs is bounded 
under 15 in NCIl, NCI109, and NCI220, 10 in MUTAG, 
NCI41, and NCI167, and 8 in ENZYMES so that the com¬ 
parison partner, FastWY, can finish in a reasonable time to 
check its peak memory consumption. For example, in EN¬ 
ZYMES with the maximum subgraph size 10, our method 
takes 3.6 hours while FastWY did not stop after two weeks. 
In D&D and the four PTC datasets, the size of subgraphs 
is unlimited. 


4.2 Results 


Runtime and memory usage 

As the main result, we compare the runtime and memory 
usage of our method Westfall-Young light and the compari¬ 
son partner FastWY for all 20 itemset mining cases and 12 
subgraph mining datasets. In both algorithms, the number 
J of permutations is the only parameter. We set J = 10"^ for 

“http://www.predictive-toxicology.org/ptc/ 

®MUTAG, ENZYMES, and D&D are obtained from 
http://mlcb.is.tuebingen.mpg.de/Mitarbeiter/Nino/ 
Graphkernels/data.zip 
'https://pubchem.ncbi.nlm.nih.gov/ 


a reason which shall be discussed later. Recall that, as dis¬ 
cussed in Section[2 the target EWER a is not a parameter, 
but a user requirement. By far, the most standard choice 
across different scientihc disciplines is a = 0.05, which is the 
option we used in this experiment. 

The results for significant itemset mining are summarized 
in Figure We can see that our algorithm reduces the 
runtime by 3 orders of magnitude in 6 out of 20 cases, by 2 
orders of magnitude in 3 cases and by 1 order of magnitude 
in 5 cases. Even more importantly, for the remaining 6 cases 
corresponding to datasets Chess, Pumsb-star and Connect, 
our comparison partner crashed due to excessive memory 
requirements. 

Indeed, as far as memory usage is concerned, we see two 
very different situations: (1) in 70 % of the cases, both meth¬ 
ods use approximately the same amount of memory (up to 
the order of magnitude) but; (2) in the remaining 30 %, 
the peak memory usage of FastWY soars up to the point 
in which the algorithm simply crashes. The actual mem¬ 
ory usage of FastWY for the cases in which it crashed is 
in fact severely underestimated in Figure |2b| the numbers 
we plotted are simply a generous lower bound on what the 
actual memory usage would have been, obtained with a tech¬ 
nique we will describe shortly. Note also that, for many of 
the datasets for which both methods have about the same 
peak RAM usage, most of that memory is used by the fre¬ 
quent itemset miner LCM. The memory used by LCM is a 
constant offset for both methods, which explains why the 
memory performance is so similar for many datasets. More 
importantly, one can see that the memory overhead of our 
algorithm scales very gently across datasets, and the over¬ 
head is negligible for about the half of the cases. On the 
other hand, FastWY shows very poor memory scaling; as 
soon as the databases get large and dense (see Chess, Con¬ 
nect or Pumsb-star), the algorithm completely breaks down. 

Another interesting observation is that, despite the fact 
that Westfall-Young light improves the runtime of FastWY 
by only one order of magnitude in 25% of the cases, those 
are precisely the datasets for which the total runtime is quite 
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(a) Runtime 



Figure 2: Performance comparison between Westfall-Young 
light and FastWY in itemset mining with 10,000 permuta¬ 
tions. Missing runtime points for FastWY correspond to 
those cases for which the algorithm crashed due to excessive 
memory requirements. We include the memory usage due 
to LCM, which corresponds to the bare minimum amount of 
memory needed to complete these tasks. Numbers attached 
to the names of datasets denote the class ratio N/n. 


small, below half an hour, except for T40I10D100K. In other 
words, we claim that the more computationally-demanding 
the transaction database is, the larger the runtime gap be¬ 
tween Westfall-Young light and FastWY gets. That, to¬ 
gether with the ill-conditioned scaling of peak memory us¬ 
age that FastWY exhibits, makes our proposal a superior 
choice for large-scale significant pattern mining. 

We can confirm the same trend in subgraph mining from 
results summarized in Figure[^ Our method Westfall-Young 
light is orders of magnitude faster than FastWY across all 
graph datasets. Moreover, Westfall-Young light reduces the 
peak memory usage by one to two orders of magnitude in 
8 out of 12 cases. Although Westfall-Young light is always 
faster than FastWY, the runtime gap is smaller than in the 
itemset mining case. The reason is that, in subgraph min¬ 
ing, most of computation is devoted to the mining process, 
which again is a constant offset for both methods. In terms 
of memory usage. West fall-Young light has a negligible over¬ 


I FastWY Westfall-Young lightj 

10 ^ ^^^^^^^^^^^^— 



LU 


(a) Runtime 



(b) Peak memory usage 


Figure 3: Performance comparison between Westfall-Young 
light and FastWY in subgraph mining with 10,000 permu¬ 
tations. The memory usage due to Gaston is included. 

head for all subgraph datasets; virtually all memory used by 
our algorithm is in fact the memory needed by Gaston to 
carry out frequent subgraph mining. That is in sharp con¬ 
trast with FastWY, which often requires significantly more 
memory to store all occurrence lists. 

Complexity analysis 

At this point, one should ponder what causes the wide spread 
in runtime between different datasets. By simply comparing 
Figures[^and[^to Tables[2 an di one can see that the mag¬ 
nitudes listed in the tables do not correlate with runtime in 
a clear way. In this section, we will derive the best predictor 
for runtime given the characteristics of a dataset. 

Let us define c{x) = = ®]i the number of pat¬ 

terns which have support x in the database. We call the 
quantity Ck = ^e{x) the total dataset cost in region 

Sfe, which is the sum of the support of every single pattern 
which is testable in region E^. As shown in FigureCfe* 
with k* being the value of k when Algorithm [^terminates, 
turns out to be clearly correlated with the total runtime. 
We confirmed the same trend in subgraph mining. That 
is hardly surprising since the total effort to compute cell 
counts is 0{JCk*) and, on the other hand, the mining effort 
scales with the non-weighted dataset cost Ck = e{x), 

which is naturally related to the weighted-counterpart Ck- 
All other factors contributing to runtime are negligible for 
Westfall-Young light, as discussed in Section 

Ck is not only a good predictor for the runtime but, also, 
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Figure 4: Execution time (s) versus Ck* for itemset mining 
datasets with 10^ repetitions of each case. Both for com¬ 
putational and visualization reasons, the most demanding 
datasets have been left out from the picture, but can be 
readily checked to follow the same quasi-linear trend. 



it is an almost exact proxy for the memory overhead of our 
comparison partner, FastWY. Since the method in 26 relies 
on storing the occnrrence list of every freqnent pattern in 
memory, assuming that each occurrence is stored as a 4- 
byte integer, the amount of RAM required for that purpose 
would be roughly 4(7*,,,,bytes. There fcmin is the minimum 
value that k* takes among all J permutations in FastWY. 
On the contrary, the memory overhead of our algorithm is 
completely unrelated to Ck, being simply NJ bytes if whole 
chars are used to store each binary entry (as we did) or 
N J/8 bytes if the matrix is packed. 

As seen in the experiments, for small and medium-size 
databases, NJ and 4:Ck^i„ appear to be roughly of the 
same order of magnitude. Both methods exhibiting simi¬ 
lar memory requirements in 70% of the itemset and 33% of 
the subgraph mining cases is a consequence of that. More¬ 
over, as we indicated previously, the peak memory usage is 
actually dominated by the requirements of the frequent pat¬ 
tern miner for many of those datasets. In more demanding 
datasets, NJ scales gently whereas Ck exhibits a combina¬ 
torial explosion. This is why FastWY cannot scale to large 
databases, leaving Westfall-Young light as the first method 
that can deal with massive, challenging databases. 

Another crucial fact is that, because FastWY treats each 
permutation independently and requires to exactly gener¬ 
ate all J samples from PDF(f2'), it needs to mine as many 
patterns as required by the worst case across all J permuta¬ 
tions. In other words, the runtime depends on fcmin, which 
is the worst-case realisation of k* across J permutations. 
In practice, fcmin can be much smaller than k* or, equiva¬ 
lently, the minimum support can be much smaller than 

uf . In other words, FastWY consistently needs to mine 
frequent patterns with much lower supports than Westfall- 
Young light and to store all their occurrence lists in RAM. 
Actual numbers are plotted in Figure 

This effect is particularly harmful for FastWY as most 
databases obey a power-law distribution in c{x), making 
patterns with low supports much more abundant in the 
database that those with large supports. Consequently, even 
small gaps between and uf can result in large run¬ 

time and storage overheads. In fact, also a direct result of 
the power-law distribution affecting c(x), the smaller erf 


I FastWY Westfall-Young light] 



(a) Itemset mining. 


I FastWY Westfaii-Young lightj 

25r- -^^^^^^^^^^-1 



(b) Subgraph mining. 


Figure 5: Final support when using Westfall-Young light, 
uf , versus the worst case support in FastWY which deter¬ 
mines its memory and runtime requirements, o-f”’". 


the more impact the gap between erf”*” and erf will have 
in the computational demands. This is particularly relevant 
for unbalanced datasets, i.e. those with large N/n, as they 
typically have low values of erf . To summarize, the com¬ 
putational complexity of FastWY is lower bounded from the 
worst case scenario out of J permutations, while Westfall- 
Young light completely bypasses this ill-posed dependence by 
generating only \aJ~\ samples from PDF(fl') exactly. 

As a side remark, the peak memory usage results for 
FastWY of 3 da tase ts (Chess, Pumsb-star, and Connect) 
shown in Figure 2b were obtained as 4(7*,* (expressed in 
MB). As we said, that is a fairly generous lower bound on the 
actual memory usage for two reasons: (1) it does not con¬ 
sider other contributions to the peak memory usage, such as 
the RAM needed to run the frequent pattern miner and; (2) 
it is obtained from Ck* and not which can be signif¬ 

icantly larger as we just discussed. Nevertheless, given the 
impossibility to actually run the algorithm without crashing, 
it is the best approximation we could achieve. 


Choosing the number of permutations J and effect of 
addressing the dependence between test statistics on 
the resulting statistical power 

As far as parameters are concerned, we must only deal with 
the number J of permutations. Intuitively, the trade-off 
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Figure 6: Empirical FWER versus J for two representative 
itemset mining databases. 
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(a) ENZYMES. 
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(b) NCI220. 


Figure 7: Empirical FWER versus J for two representative 
subgraph mining databases. 


involved when setting J is clear: the larger J, the more pre¬ 
cise the estimation of 5* will be at the expense of increased 
runtime. In problems for which the frequent pattern min¬ 
ing effort is the dominant runtime factor, such as subgraph 
mining, the execution time increases sublinearly with J. On 
the other hand, in problems for which computing cell counts 
is the main bottleneck, like most itemset mining examples, 
the increase will be almost exactly linear. 

To illustrate the effect of changes in the number J of per¬ 
mutations, we executed Westfall-Young light for 10 differ¬ 
ent values of J between J = 10^ and J = 10"^ in steps of 
AJ = 10®. Eor each pair (dataset, J) we repeat the execu¬ 
tion 100 times and show the median empirical EWER as a 
function of J, along with the corresponding 5%-95% con¬ 
fidence interval. Furthermore, we also show the resulting 
FWER obtained by using the corrected significance thresh¬ 
old obtained with LAMP, Jlamp [^, which is the first 
prominent method that controls the FWER in pattern min¬ 
ing (see Section]^ for more detail). Note that the FWER 
with the Bonferroni factor must be always smaller than that 
with LAMP. 

The reason why we chose the empirical FWER as the 
quantity of interest is that it is the best proxy for the result¬ 
ing statistical power of the method. Ideally, an algorithm 
for which EWER = a would be optimal among all statisti¬ 
cal testing procedures controlling the FWER at level a. On 
the other hand, if the resulting empirical FWER satisfies 
FWER < a, there is a loss of power due to the scheme being 
too conservative. When dealing with discrete test statistics, 
as in the case of this paper, achieving FWER = a might 
not always be possible yet the reasoning remains the same: 
the closer EWER is to a (without ever being bigger), the 
better. To summarize, the purpose of this experiment is 
twofold: (1) we can see the effect that correcting for depen¬ 
dence via permutation-testing has on the resulting statistical 
power and (2) we can show that increasing J beyond a cer¬ 
tain value has a very small effect in the resulting empirical 
FWER and its spread. 

We depict the results for four sample datasets due to space 
considerations: two datasets in itemset mining (BmsWeb¬ 
view and T40I10D100K) in Eigure and two datasets in 
subgraph mining (ENZYMES and NCI220) in Figure]^ As 
it can be seen, the median FWER appears to be fairly sta¬ 
ble to changes in J and, more importantly, the spread of 
the empirical FWER saturates at about J = 10'*. There¬ 
fore, we believe J = 10* to be the safest parameter choice. 


Moreover, for runtime-limited scenarios for which cell count 
computations are the main bottleneck, using J = 10® might 
still lead to good performance while reducing the runtime 
by about one order of magnitude. 

Finally, our results also clearly demonstrate that LAMP is 
a severely over conservative algorithm. It tends to yield an 
empirical FWER which oscillates between a/2 and a/100 
depending on the dataset. Even just halving the target 
FWER can have drastic consequences in the resulting statis¬ 
tical power: when dealing with massive multiple-hypothesis 
testing, as in the case of significant pattern mining, picking 
up subtle signals is the main objective. A large amount of 
significant patterns might be lost as a consequence of over¬ 
controlling the FWER. 

5. RELATED WORK 

We briefly describe other existing methods for mining sta¬ 
tistically significant patterns. As discussed in the introduc¬ 
tion, the overwhelming majority of papers that propose al¬ 
gorithms to find significant patterns do not address multiple 
testing. Thus the soundness of their approaches is not re¬ 
liable since they do not provide provable guarantees on the 
probability or proportion of false discoveries being made by 
the algorithm. An alternative approach is to limit the 
maximum size of the patterns being enumerated to reduce 
the overall number of tests, thus reducing the loss of statisti¬ 
cal power due to multiple-testing. However, since the num¬ 
ber of tests (patterns) scales combinatorially, that method 
can deal with only very small patterns before losing all sta¬ 
tistical power. In the following we review only those algo¬ 
rithms which rigorously correct for multiple testing without 
imposing restrictive limits on the pattern size. 

The concept of minimum attainable p-value for discrete 
test statistics and its usefulness for multiple hypothesis test¬ 
ing was first noticed by Tarone [^. He realized that, given 
a tentative corrected significance threshold 5 and just based 
on the margins x, n, and N of the 2x2 contingency ta¬ 
ble, one can conclude that all patterns for which 4'(a;i) > 5 
have no chance of being significant and can be pruned from 
the search space. Even more importantly, because those 
patterns can never be significant, they do not need to be 
included in the Bonferroni correction factor. 

Precisely, let m{S) — \Tt{5)\- Then, Tarone proved that 
FWER < Sm{S) and that an improved corrected significance 
threshold can be found as <5* = max{ S \ Sm{S) < a }. Since 
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m{5) <Si D in most real-world datasets, this improvement 
can bring a drastic increase in statistical power. 

This improved Bonferroni correction for discrete data re¬ 
quires computing ^(si) for each pattern to evaluate m(5) 
as many times as needed in a linear search. Needless to say, 
in a data mining context, that approach is unfeasible as it 
would require mining and computing the support of every 
single pattern in the dataset. The computational unfeasibil¬ 
ity of the original method resulted in it being largely ignored 
by the data mining community until the 2013 breakthrough 
paper by Terada et al. [^. There, the authors propose 
the Limitless-Arity Multiple-testing Procedure (LAMP), a 
branch-and-bound algorithm that uses a frequent itemset 
mining algorithm to efficiently apply Tarone’s method to 
data mining problems of the form considered in this paper. 

On the algorithmic side, a decremental search strategy 
was adopted in LAMP, same as FastWY. They let ct = n at 
initialization and solve a sequence of frequent pattern mining 
problems, using the miner as a black box to evaluate m((j) = 


\Tt{(j{S))\ for each a until the condition < a is 

violated for some value of a. When that occurs, one can 
conclude that the optimal a* satisfies cr* = a -\- 1. 

LAMP can be seen as the first successful instance of multi¬ 
ple-testing correction applied to significant pattern mining, 
as it can discover significant itemsets of arbitrary size while 
upper bounding the FWER. Moreover, it has a reasonable 
runtime and storage complexity, at least in small and mid¬ 
sized datasets. In order to tackle larger problems, two stud¬ 
ies 17 and |22| simultaneously proposed changing the search 


strategy to an incremental scheme with some form of early 
stopping; the first in the context of subgraph mining and 
the latter to deal with itemset mining problems. The basic 
idea is to initialize a = 1 and iteratively increase cr every 
time the condition '^[a)ifi{a) < a is found to be violated. 
The frequent pattern mining algorithm does not need to be 
restarted every time cr changes, making the whole process 
efficient. Empirically, this new strategy has been shown to 
bring a large runtime reduction in both scenarios. 

Nevertheless, neither the conceptual method described 
in [^, its first practical implementation in nor the 
runtime-optimized version in address the dependence 
structure existing between test statistics. Consequently, they 
are strictly suboptimal testing procedures which lose a large 
fraction of statistical power by overestimating the FWER 
(see Figure [^. In [22], we tried to alleviate that problem by 
exploiting the concept of effective number of tests, estimated 
by using classical permutation-testing only on the subset of 
testable subgraphs. Since the method in 22 does not use 
algorithmic tricks to apply permutation-testing, its compu¬ 
tational feasibility relies on the set of testable subgraphs 
being small enough. Nevertheless, that is often the case 
in practice. All-in-all, that idea constitutes a step towards 
addressing dependence, but it is still a heuristic, strictly 
suboptimal approach, which controls the FWER more than 
needed. 

To the best of our knowledge, the FastWY algorithm in 
[26| is the only attempt so far to optimally take the depen¬ 
dence between test statistics in pattern mining into account. 
However, as discussed in Section FastWY is based on a 
decremental search scheme instead of the more efficient in¬ 
cremental counterpart. Moreover, as FastWY has to repeat 
the decremental search J ~ 10^ times, including the pattern 
mining effort, the performance gap between decremental and 


incremental search is expected to be even more dramatic. 
Furthermore, as we have shown in Section [^ that runtime 
gap can still be as high as 3 orders of magnitude even when 
very large amounts of RAM are traded-off for speed. 

6. CONCLUSIONS 

In this paper, we have described a novel algorithm, called 
Westfall-Young light, for mining statistically significant pat¬ 
terns which allows the user to adjust exactly the probability 
of having false discoveries. It estimates the null distribu¬ 
tion of the test statistics via Westfall-Young permutations, 
and succeeds to overcome the massive computational cost of 
permutation testing in large databases by exploiting a set of 
computational tricks. 

Empirically, our West fall-Young light algorithm drasti¬ 
cally improves upon the state-of-the-art: The runtime de¬ 
creases by up to three orders of magnitude and the peak 
memory usage by one up to two orders of magnitude in sev¬ 
eral itemset and subgraph mining benchmarks. Moreover, 
we also show that the peak memory usage of Westfall-Young 
light scales gently with the complexity of the database. In 
contrast, the peak memory usage of the state-of-the-art al¬ 
gorithm soars as the databases get large and dense, thus 
breaking down in large-scale problems. 

Several interesting challenges still remain to be addressed: 
In domains such as computational biology, there is a ris¬ 
ing interest in less conservative statistical testing procedures 
which enjoy increased statistical power, such as FDR con¬ 
trol 1^. Another critical problem is how to correct for con- 
founders [^. Those are predictive features which are corre¬ 
lated to both the target response and some of the patterns, 
artificially inflating the resulting p-values. Extending the 
framework in either of those directions would represent a 
very valuable contribution. 
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Input: Dataset, class labels y, number of permutations J, 
and significance threshold a 
Output: Corrected significance threshold 5* 
function Westfall-Young Light(o, J) 
for j = I,. .., J do 


yO) 4 
■ 

^ min 

end for 


• permute(y) // Permute class labels y 
1 


n/N 
^ lN/2\ 


k i — 1 , 8]^ ^— 
erf -i- 1 , o-f 
flag ■<— 1 

PROCESSNEXT(root, TV) 

// Patterns are enumerated through a rooted tree 

end function 

function ProcessNext( 2 , Xi) 


if Xi G [o-i,(7l^] U [TV — 




then 


Compute p-values Pi( 7 ) for all 7 G [ai,n 

for j = 1,..., J do 
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Compute aj' 


(i) 


^min 

end for 


min{p. 


(j) 




if) 


min — 


< <5/e > a do 


while i T,j=i 1 
h i — fc -|- 1 

UpdateThreshold(/c, 

end while 
end if 

for j G Children( 2 ) do 
Compute Xj 
if Xj > cr^ then 

PrOCESSNeXTQ', Xj) 

end if 
end for 
Return <5* G 
end function 

function UPDATETHRESHOLD(fc, 5k,(7^,al^) 
if flag = 1 then 
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if then 


else 

flag 

end if 
else 
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